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Abstract 

We propose a novel method for improving word 
alignments in a parallel sentence-aligned bilin- 
gual corpus based on the idea that if two words 
are translations of each other then so should 
be many words in their local contexts. The 
idea is formalised using the Web as a corpus, 
a glossary of known word translations (dynami- 
cally augmented from the Web using bootstrap- 
ping), the vector space model, linguistically mo- 
tivated weighted minimum edit distance, com- 
petitive linking, and the IBM models. Evalua- 
tion results on a Bulgarian-Russian corpus show 
a sizable improvement both in word alignment 
and in translation quality. 
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1 Introduction 

The beginning of modern Statistical Machine Transla- 
tion [SMT) can be traced back to 1988, when Brown et 
al. [5] from IBM published a formalised mathematical 
formulation of the translation problem and proposed 
five word alignment models - IBM models 1, 2, 3, 4 and 
5. Starting with a bilingual parallel sentence-aligned 
corpus, the IBM models learn how to translate individ- 
ual words and the probabilities of these translations. 
Later, decoders like the ISI ReWrite Decoder [9] 
became available, which made it possible to quickly 
build SMT systems with decent quality. 

An important shift happened in 2004, when the 
Pharaoh model [11] has been proposed, which uses 
whole phrases (typically of length up to 7, not neces- 
sarily representing linguistic units), rather than just 
words. This led to a significant improvement in trans- 
lation quality, since phrases can encode local gen- 
der/number agreement, facilitate choosing the correct 
sense for ambiguous words, and naturally handle fixed 
phrases and idioms. While methods have been pro- 
posed for learning translation phrases directly [17], 
the most popular alignment template approach [23] re- 
quires bi-directional word alignments at the sentence 
level from which phrases consistent with those align- 
ments are extracted. Since better word alignments can 
lead to better phrases-*^, improving word alignments re- 
mains one of the primary research problems in SMT: in 



fact, there are more papers published yearly on word 
alignments than on any other SMT subproblem. 

In the present paper, we describe a novel method for 
improving word alignments using the Web as a corpus, 
a glossary of known word translations (dynamically 
augmented from the Web using bootstrapping), the 
vector space model, weighted minimum edit distance, 
competitive linking, and the IBM models. The poten- 
tial of the method is demonstrated on a Bulgarian- 
Russian bilingual corpus. 

The rest of the paper is organised as follows: section 
2 explains the method in detail, section 3 describes 
the corpus and the resources used, section 4 contains 
the evaluation, section 5 points to important related 
research, and section 6 concludes with some possible 
directions for future work. 



2 Method 

Our method combines two similarity measures which 
make use of different information sources. First, we 
define a language-specific modified minimum edit dis- 
tance, based on linguistically-motivated rules target- 
ing Bulgarian-Russian cognate pairs. Second, we 
define a distributional semantic similarity measure, 
based on the idea that if two words represent a transla- 
tions pair, then the most frequently co-occurring words 
in their local contexts should be translations of each 
other as well. This intuition is formalised using the 
Web as a corpus, a bilingual glossary of word trans- 
lation pairs used as "bridges", and the vector space 
model. The two measures are combined with com- 
petitive linking [19] in order to obtain high quality 
word translation pairs, which are then appended to 
the bilingual sentence-aligned corpus in order to bias 
the subsequent training of the IBM word alignment 
models [5]. 



2.1 Orthographic Similarity 

We use an orthographic similarity measure, which is 
based on the minimum edit distance (med) or Leven- 
shtein distance [16]. med calculates the distance be- 
tween two strings si and S2 as the minimum number of 
edit operations - insert, replace, delete - needed 
to transform si into S2- For example, the med be- 
tween r. nepBBiH (Russian, 'first') and b. ntpsHHT 



^ The dependency between word alignments and translation 



quality is indirect; improving the former does not necessarily 
improve the latter. 



(Bulgarian, Hhe first') is 4: three replace operations 
(e ^ t, Bi ^ H, H ^ h) and one insert (of t). 

We modify the classic med in two ways. First, we 
normalise the two strings, taking into account some 
general graphemic correlations between the phonetico- 
graphemic systems of the two closely-related Slavonic 
languages - Bulgarian and Russian: 

• For Russian words, we remove the letters b and 
t, as their graphemic collocations are excluded 
in Bulgarian, e.g. b between two consonants (r. 
CHJiBHo ^^ h. CHJiHo, strougly) , t following a 
consonant (r. oStHBJieHHe ^^ b. oSHBJieHwe, 
an announcement), etc. 

• For Russian words, we remove the ending ii, 
which is the typical nominative adjective ending 
in Russian, but not in Bulgarian, e.g. r. /ijeTCKHH 
^^ b. /];eTCKH (children's). 

• For Bulgarian words, we remove the definite arti- 
cle, e.g. b. ropcKHHT {the forestal) -^ b. ropcKw 
{forestal). The definite article is the only aggluti- 
native morpheme in Bulgarian and has no coun- 
terpart in Russian: Bulgarian has definite, but 
not indefinite article, and there are no articles in 
Russian. 

• We transliterate the Russian-specific letters 
(missing in the Bulgarian alphabet) or letter com- 
binations in a regular way: bi ^^ h, 9 ^^ e, and 
HIT ^^ m, e.g. r. 9JieKTpoH ^^ b. ejieKTpoH (an 
electron), r. bbiji ^^ b. bhji (past participle of to 
howl), r. niTaS ^^ b. ii];a5 (mil. a staff), etc. 

• Finally, we remove all double letters in both lan- 
guages (e.g. HH ^ h; cc ^ c): While consonant 
and vowel doubling is very rare in Bulgarian (ex- 
cept at morpheme boundaries for a limited num- 
ber of morphemes), it is more common in Russian, 
e.g. in case of words of foreign origin: r. accaM- 
Sjigh -^ b. acaMSjien [an assembly) 

Second, we use different letter-pair specific costs for 
REPLACE. We use 0.5 for all vowel to vowel substitu- 
tions, e.g. o ^^ e as in r. jihdjo ^^ b. jiwDje (a face). 
We also use 0.5 for some consonant-consonant replace- 
ments, e.g. c ^^ 3. Such regular phonetic changes are 
reflected in different ways in the orthographic systems 
of the two languages, Bulgarian being more conser- 
vative and sticking to morphological principles. For 
example, in Bulgarian the final 3 in prefixes like h3- 
and pa3- never change to c, while in Russian they 
sometimes do, e.g. r. HCCJie/];oBaTejiB ^^ b. H3CJie- 
^oBaTeji [an explorer), r. paccKa3 ^^ b. pa3Ka3 (a 
story), etc. 

We use a cost of 1 for all other replacements. 

It is easy to see that this modified minimum edit 
distance (mmed) is more adequate than med - it is 
only 0.5 for r. nepsBiH and b. m^pBHiiT: we first 
normalise them to nepBw and n'BpsH, and then we 
do a single vowel-vowel replace with the cost of 0.5. 

We transform mmed into a similarity measure, mod- 
ified minimum edit distance ratio (mmedr) using the 
following formula (|s| is the number of letters in s be- 
fore the normalisation): 



MMEDR(5i,52) = 1 - ""^^"^f!'^^.^ 

V -L' ^/ max(|si|,|s2|) 

Below we compare mmedr with minimum edit dis- 
tance ratio (medr): 

i\/n:^-rvD/ \ 1 MED(si,s2) 

MEDR(5i,52) = 1 - ^,,(|,\|,|,,i) 

and longest common subsequence ratio (lcsr) [18]: 

LCSR(si,S2)= ^^"^fhT^l 

In the last definition, LCS(si,S2) refers to the 
longest common subsequence of si and S2, e.g. 
LCS(nepBBm, m^pEHHT) = nps, and therefore 

MMEDR(nepBBIH, H'BpBHHT) = 3/7 ~ 0.43 

We obtain the same score using mmed: 

MMED(nepBBm, h'bpbhht) = 1 - 4/7 ^ 0.43 

while with mmedr we have: 

MMEDR(nepBBm, m^pEHHT) = 1 — 0.5/7 « 0.93 

2.2 Semantic Similarity 

The second basic similarity measure we use is web- 
ONLY, which measures the semantic similarity between 
a Russian word Wm and a Bulgarian word Wbg us- 
ing the Web as a corpus and a glossary G of known 
Bulgarian-Russian translation pairs used as "bridges" . 
The basic idea is that if two words are translations of 
each other then many of the words in their respective 
local contexts should be mutual translations as well. 

First, we issue a query to Google for Wm or Wbg, 
limiting the language to Russian or Bulgarian, and we 
collect the text from the resulting 1,000 snippets. We 
then extract the words from the local context (two 
words on either side of the target word), we remove 
the stopwords (prepositions, pronouns, conjunctions, 
interjections and some adverbs), we lemmatise the re- 
maining words, and we filter out the words that are 
not in G. We further replace each Russian word with 
its Bulgarian counter-part in G. As a result, we end 
up with two Bulgarian frequency vectors, correspond- 
ing to Wru and Wbg, respectively. Finally, we tf.idf- 
weight the vector coordinates [31] and we calculate the 
semantic similarity between Wbg and Wm as the cosine 
between their corresponding vectors. 

2.3 Combined Similarity Measures 

In our experiments (see below), we have found that 
MMEDR yields a better precision, while web-only has 
a better recall. Therefore we tried to combine the two 
similarity measures in different ways: 

• WEB-AVG: average of web-only and mmedr; 

• WEB-MAX: maximum of web-only and mmedr; 

• WEB-CUT: The value of web-CUt(si, S2) is 1, if 
mmedr(si,S2) > a {0 < a < 1), and is equal to 
web-only(si, S2), otherwise. 



2.4 Competitive Linking 

The above similarity measures are used in combina- 
tion with competitive linking [19], which assumes that 
a source word is either translated with a single target 
word or is not translated at all. Given a sentence pair, 
the similarity between all Bulgarian-Russian word 
pairs is calculated^, which induces a fully-connected 
weighted bipartite graph. Then a greedy approxima- 
tion to the maximum weighted bipartite matching in 
that graph is extracted as follows: First, the most sim- 
ilar pair of unaligned words is aligned and both words 
are discarded from further consideration. Then the 
next most similar pair of unaligned words is aligned 
and the two words are discarded, and so forth. The 
process is repeated until there are no unaligned words 
left or until the maximal word pair similarity falls be- 
low a pre-specified threshold ^ (0 < ^ < 1), which 
could leave some words unaligned. 



3 Resources 

3.1 Parallel Corpus 

We use a parallel sentence-aligned Bulgarian-Russian 
corpus: the Russian novel Lord of the Worlcf by 
Alexander Beliaev and its Bulgarian translation^. The 
text has been sentence aligned automatically using the 
alignment tool MARK ALISTeR [26], which is based 
on the Gale-Church algorithm [8]. As a result, we ob- 
tained 5,827 parallel sentences, which we divided into 
training (4,827 sentences), tuning (500 sentences), and 
testing set (500 sentences). 



sian word with each Bulgarian one - due to poly- 
semy/homonymy some words had multiple transla- 
tions. As a result, we obtained a glossary G of 3,794 
word translation pairs. 

Due to the modest glossary size, in our initial ex- 
periments, we were lacking translations for many of 
the most frequent context words. For example, when 
comparing r. njiaTBe (a dress) and b. poKJiH (a 
dress), we find adjectives like r. CBa^e6iioe {wed- 
ding) and r. Be^iepnee (evening) among the most fre- 
quent Russian context words, and b. CBaTSena and 
b. Be^iepna among the most frequent Bulgarian con- 
text words. While missing in our bilingual glossary, 
it is easy to see that they are orthographically similar 
and thus likely cognates. Therefore, we automatically 
extended G with possible cognate pairs. For the pur- 
pose, we collected the most frequent 30 non-stopwords 
RUso and BG so from the local contexts of Wm and 
Wbg, respectively, that were missing in our glossary. 
We then calculated the mmedr for every word pair 
(r, b) e {RUso, BGso), and we added to G ah pairs for 
which the value was above 0.90. As a result, we man- 
aged to extend G with 6,289 additional high-quality 
translation pairs. 

4 Evaluation 

We evaluate the similarity measures in four different 
ways: manual analysis of WEB-CUT, alignment quality 
of competitive linking, alignment quality of the IBM 
models for a corpus augmented with word translations 
from competitive linking, and translation quality of a 
phrase-based SMT trained on that corpus. 



3.2 Grammatical Resources 

We use monolingual dictionaries for lemmatisation. 
For Bulgarian, we use a large morphological dictionary, 
containing about 1,000,000 wordforms and 70,000 lem- 
mata [25], created at the Linguistic Modeling De- 
partment, Institute for Parallel Processing, Bulgar- 
ian Academy of Sciences. The dictionary is in DE- 
LAF format [30]: each entry consists of a wordform, a 
corresponding lemma, followed by morphological and 
grammatical information. There can be multiple en- 
tries for the same wordform, in case of multiple homo- 
graphs. We also use a large grammatical dictionary 
of Russian in the same format, consisting of 1,500,000 
wordforms and 100,000 lemmata, based on the Gram- 
matical Dictionary of A. Zaliznjak [33]. Its electronic 
version was supplied by the Computerised fund of Rus- 
sian language. Institute of Russian language, Russian 
Academy of Sciences. 

3.3 Bilingual Glossary 

We built a bilingual glossary from an online Bulgarian- 
Russian dictionary^. First, we removed all multi- 
word expressions. Then we combined each Rus- 



^ Due to their special distribution, stopwords and short words 

(one or two letters) are not used in competitive hnking. 
^ http://www.lib.ru 
^ http://borislav.free.fr/mylib 
^ http : //www . bgru . net/intr/dict ionary/ 



4.1 Manual Evaluation of WEB-CUT 

Recall that by definition web-CUt(si, S2) is 1, if 
mmedr(si, S2) > a, and is equal to web-only(si, S2), 
otherwise. To find the best value for a, we tried all 
values a E {0.01, 0.02, 0.03, . . . , 0.99}. For each value, 
we word-aligned the training sentences from the par- 
allel corpus using competitive linking and WEB-CUT, 
and we extracted a list of the distinct aligned word 
pairs, which we added twice as additional "sentence" 
pairs to the training corpus. We then calculated the 
perplexity of IBM model 4 for that augmented corpus. 
This procedure was repeated for all candidate values 
of a, and finally a = 0.62 was selected as it yielded 
the lowest perplexity.^ 

The last author, a native speaker of Bulgarian who 
is fiuent in Russian, manually examined and anno- 
tated as correct, rough or wrong the 14,246 distinct 
aligned Bulgarian-Russian word type pairs, obtained 
with competitive linking and web-cut for a = 0.62. 
The following groups naturally emerge: 

1. "Identical" word pairs (mmedr(si, S2) = 1): 
1,309 or 9% of all pairs. 70% of them are com- 
pletely identical, e.g. CKopo (soon) is spelled the 
same way in both Bulgarian and Russian. The re- 
maining 30% exhibit regular graphemic changes, 
which are recognised by mmedr (See section 2.1.) 



^ This value is close to 0.58, which has been found to perform 
best for LCSR on Western-European languages [15]. 



2. "True friends" {a < mmedr(si,S2) < 1): 
5,289 or 37% of all pairs. This group reflects 
changes combining regular phonemic and mor- 
phemic (grammatical) correlations. Examples in- 
clude similar but not identical affixes (e.g. the 
Russian preflxes bo- and co- become Bt- and ct- 
in Bulgarian), similar graphemic shapes of mor- 
pheme values (e.g. the Russian singular feminine 
adjective endings -asi and -hh become -a and -h 
in Bulgarian), etc. 

3. "Translations" (mmedr(si, S2) < a): 7,648 
or 54 % of all pairs. Here the value of web- 
ONLy(si,S2) is used. We divide this group into 
the following sub-categories: correct (73%), rough 
(3%) and wrong (24%). 

Our analysis of the rough and wrong sub-groups of 
the latter group exposes the inadequacy of the idea of 
reducing sentence translation to a sequence of word- 
for-word translations, even for closely related lan- 
guages like Bulgarian and Russian. Laying aside the 
translator's freedom of choice, the translation corre- 
spondences often link a word to a phrase, or a phrase to 
another phrase, often idiomatically, and sometimes in- 
volve syntactic transformations as well. For example, 
when aligning the Russian word r. oTBepnyTBCH to 
its Bulgarian translation b. oOptmaM rptO {to turn 
back), competitive linking wrongly aligns r. OTBep- 
HyTBCH to b. rptO (a back). Similarly, when the 
Russian for to challenge, r. OpocaTB bbibob (lit. to 
throw a challenge), is ahgned to its Bulgarian transla- 
tion b. xBtpjiHM ptKaBHi];a (lit. to throw a glove), 
this results in wrongly aligning r. bbi3ob (a challenge) 
to b. ptKaBHi];a (a glove). Note however that such 
alignments are still helpful in the context of SMT. 

Figure 1 shows the precision-recall curve for the 
manual evaluation of competitive linking with web- 
CUT for the third group only (mmedr(si, S2) < a), 
considering both rough and wrong as incorrect. We 
can see that the precision is 0.73 even for recall of 1. 




Fig. 1: Manual evaluation of WEB-CUT: Precision- 
recall curve for competitive linking with web-cut on 
the "translations" sub-group (^mmedr(si, S2) < 0.62J. 



4.2 Word Alignments 

4.2.1 Gold Standard Word Alignments 

The last author, a linguist, manually aligned the flrst 
100 sentences from the training corpus, thus creating a 



gold standard for calculating the alignment error rate 
[AER) for the different similarity measures. 

Manual alignments typically use two kinds of links: 
sure and possible. As we have seen above, even for 
closely related languages like Russian and Bulgarian, 
the alignment of each source word to a target one 
could be impossible, unless a suitable convention is 
adopted. Particularly problematic are the "hanging" 
single words - typically stemming from syntactic dif- 
ferences. We prefer to align such word to the same 
target word to which is aligned the word it is depen- 
dent on, and to mark the link as possible, rather than 
sure. More formally, if the source Russian word Xm is 
translated with a pair of target Bulgarian words x^g 
and yhg, where Xm is a sure translation of Xhg, and yig 
is a grammatical or "empty" word ensuring the cor- 
rect surface presentation of the grammatical/lexical 
relation, then we add a possible link between y^g to 
Xru as well. 

For instance, the Russian genitive case is typically 
translated in Bulgarian with a prepositional phrase, 
Ha+nown, e.g. r. 3ByKH MysBiKH [sounds of music) 
is translated as b. 3Byi];HTe na MyawKaTa. Other ex- 
amples include regular ellipsis/dropping of elements 
speciffc for one of the languages only, e.g. subject 
dropping in Bulgarian, ellipsis of Russian auxiliaries 
in present tense, etc. For example, r. h snaji (/ knew) 
can be translated as b. as 3Haex, but also as b. 3Haex. 
On the other hand, r. oh repofi ('/le is a hero\ lit. 
'he hero') is translated as b. Tofi e repofi (lit. 'he is 
hero'). 

4.2.2 Competitive Linking 

Figure 2 shows the AER for competitive linking with 
all 7 similarity measures: our orthographic and se- 
mantic measures (mmedr and web-only), the three 
combinations (web-cut, web-max and web-AVG), 
as well as for LCSR and medr. We can see an im- 
provement of up to 6 AER points when going from 
lcsr/medr/web-only to web-cut/web-avg. Note 
that here we calculated the AER on a modiffed ver- 
sion of the 100 gold standard sentences - the stopwords 
and the punctuation were removed in order to ensure 
a fair comparison with competitive linking, which ig- 
nores them. In addition, each of the measures has 
its own threshold 6 for competitive linking (see sec- 
tion 2.4), which we set by optimising perplexity on 
the training set, as we did for a in the section 4.1: 
we tried all values of 6' e {0.05, 0.10,. . . , 1.00}, and we 
selected the one which yielded the lowest perplexity. 
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Fig. 2: AER for competitive linking: stopwords 
and punctuation are not considered. 



4.2.3 IBM Models 

In our next experiment, we first extracted a list of 
the distinct word pairs aligned with competitive link- 
ing, and we added them twice as additional "sen- 
tence" pairs to the training corpus, as in section 4.1. 
We then generated two directed IBM model 4 word 
alignments (Bulgarian -^ Russian, Russian -^ Bulgar- 
ian) for the new corpus, and we combined them using 
the interect+grow heuristic [22]. Table 3 shows the 
AER for these combined alignments. We can see that 
while training on the augmented corpus lowers AER 
by about 4 points compared to the baseline (which is 
trained on the original corpus) , there is little difference 
between the similarity measures. 
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Fig. 4: Translation quality: Bleu score. 
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Fig. 3: AER for IBM model 4: inters ect+ grow. 



4.3 Machine Translation 

As we said in the introduction, word alignments are 
an important first step in the process of building a 
phrase-based SMT. However, as many researchers have 
reported, better AER does not necessarily mean im- 
proved machine translation quality [2]. Therefore, we 
built a full Russian -^ Bulgarian SMT system in order 
to assess the actual impact of the corpus augmentation 
(as described in the previous section) on the transla- 
tion quality. 

Starting with the symmetrised word alignments de- 
scribed in the previous section, we extracted phrase- 
level translation pairs using the alignment template ap- 
proach [13]. We then trained a log-linear model with 
the standard feature functions: language model prob- 
ability, word penalty, distortion cost, forward phrase 
translation probability, backward phrase translation 
probability, forward lexical weight, backward lexical 
weight, and phrase penalty. The feature weights, were 
set by maximising Bleu [24] on the development set 
using minimum error rate training [21]. 

Tables 4 and 5 show the evaluation on the test set 
in terms of Bleu and NIST scores. We can see a siz- 
able difference between the different similarity mea- 
sures: the combined measures (web-cut, web-max 
and web-AVG) clearly outperforming LCSR and medr. 
MMEDR outperforms them as well, but the difference 
from LCSR is negligible. 



5 Related Work 

Many researchers have exploited the intuition that 
words in two different languages with similar or identi- 
cal spelling are likely to be translations of each other. 



Al-Onaizan & al. [1] create improved Czech-English 
word alignments using probable cognates extracted 
with one of the variations of LCSR [18] described in 
[32]. They tried to constrain the co-occurrences, to 
seed the parameters of IBM model 1, but their best 
results were achieved by simply adding the cognates 
to the training corpus as additional "sentences" . Us- 
ing a variation of that technique, Kondrak, Marcu and 
Knight [15] demonstrated improved translation qual- 
ity for nine European languages. We extend this work, 
by adding competitive linking [19], language-specific 
weights, and a Web-based semantic similarity mea- 
sure. 

Koehn &; Knight [12] describe several techniques for 
inducing translation lexicons. Starting with unrelated 
German and English corpora, they look for (1) identi- 
cal words, (2) cognates, (3) words with similar frequen- 
cies, (4) words with similar meanings, and (5) words 
with similar contexts. This is a bootstrapping process, 
where new translation pairs are added to the lexicon 
at each iteration. 

Rapp [27] describes a correlation between the co- 
occurrences of words that are translations of each 
other. In particular, he shows that if in a text in 
one language two words A and B co-occur more of- 
ten than expected by chance, then in a text in an- 
other language the translations of A and B are also 
likely to co-occur frequently. Based on this observa- 
tion, he proposes a model for finding the most accurate 
cross-linguistic mapping between German and English 
words using non-parallel corpora. His approach differs 
from ours in the similarity measure, the text source, 
and the addressed problem. In later work on the same 
problem, Rapp [28] represents the context of the target 
word with four vectors: one for the words immediately 
preceding the target, another one for the ones immedi- 
ately following the target, and two more for the words 
one more word before/after the target. 

Fung and Yee [7] extract word-level translations 



from non-parallel corpora. They count the number of 
sentence-level co-occurrences of the target word with 
a fixed set of "seed" words in order to rank the candi- 
dates in a vector-space model using different similarity 
measures, after normalisation and TF.iDF-weighting 
[31]. The process starts with a small initial set of 
seed words, which are dynamically augmented as new 
translation pairs are identified. We do not have a 
fixed set of seed words, but generate it dynamically, 
since finding the number of co-occurrences of the tar- 
get word with each of the seed words would require 
prohibitively many search engine queries. 

Diab & Finch [6] propose a statistical word-level 
translation model for comparable corpora, which finds 
a cross-linguistic mapping between the words in the 
two corpora such that the source language word-level 
co-occurrences are preserved as closely as possible. 

Finally, there is a lot of research on string sim- 
ilarity which has been or potentially could be ap- 
plied to cognate identification: Ristad&Yianilos'QS 
[29] learn the med weights using a stochastic trans- 
ducer. Tiedemann'99 [32] and Mulloni&Pekar'OG [20] 
learn spelling changes between two languages for LCSR 
and for nedr respectively. Kondrak'05 [14] pro- 
poses longest common prefix ratio, and longest com- 
mon subsequence formula, which counters lcsr's pref- 
erence for short words. Klementiev&Roth'OG [10] 
and Bergsma&Kondrak'OT [3] propose a discrimina- 
tive frameworks for string similarity. Brill&Moore'OO 
[4] learn string-level substitutions. 

6 Conclusion and Future Work 

We have proposed and demonstrated the potential of 
a novel method for improving word alignments using 
linguistic knowledge and the Web as a corpus. 

There are many things we plan to do in the future. 
First, we would like to replace competitive linking with 
maximum weight bipartite matching. We also want to 
improve mmed by adding more linguistically knowl- 
edge or by learning the nedr or LCSR weights auto- 
matically as described in [20, 29, 32]. Even better re- 
sults could be achieved with string-level substitutions 
[4] or a discriminative approach [3, 10] . 
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